fix: Colab compatibility, NYC TLC dataset, new README and AGENTS.md by CLAV88 · Pull Request #1 · coder2j/pyspark-tutorial

CLAV88 · 2026-05-02T17:11:59Z

What this PR fixes

This PR makes the tutorial work correctly on Google Colab,
which is how most beginners will run it.

Problems fixed

Environment cell breaks Colab — The os.environ SPARK_HOME
cell in every notebook hardcodes a local Mac path that does not
exist on Colab's VM. Replaced with a pip install pyspark setup
cell that works on any machine.

Relative data paths fail — All ./data/ paths replaced with
/content/data/ absolute paths that Colab can resolve.

Write cells fail on re-run — Added .mode("overwrite") to all
df.write cells. Default mode is error — fails every time after
the first run, which is the normal learner workflow.

Clone fails on re-run — Added existence check before git clone
calls. Without this, re-running any notebook throws exit code 128.

Dataset replacement

Replaced all synthetic sample data (5-20 rows) with the NYC TLC
Yellow Taxi dataset (Jan 2023, ~3M rows, 19 columns). Real financial
columns — fare_amount, tip_amount, payment_type — make every
exercise more meaningful and representative of production data work.

New files

README.md — rewritten with stage structure, Colab setup
instructions, per-stage test questions, and embedded diagrams
AGENTS.md — AI-assisted teaching guidance and common error
reference for learners using AI tools to work through the tutorial
assets/ — 4 SVG diagrams illustrating key concepts

Tested on

Google Colab, Spark 4.0.2, pip-installed pyspark

- Remove os.environ SPARK_HOME cell from all notebooks (hardcoded local Mac path breaks Colab — pip install pyspark bundles its own binaries, no SPARK_HOME needed) - Replace all ./data/ relative paths with /content/data/ (Colab resolves relative paths against a temp kernel directory that does not exist — absolute paths required) - Add mode('overwrite') to all df.write cells (default write mode is error — fails on every re-run after first, which is the normal learner workflow) - Add idempotent git clone guard in data bootstrap cells (exit code 128 on re-run because destination directory already exists — shutil.rmtree guard makes bootstrap safe to re-run) - Replace synthetic sample data with NYC TLC Yellow Taxi dataset (Jan 2023, ~3M rows, 19 columns including fare_amount, tip_amount, payment_type, datetime fields — more representative of production data engineering than 5-row synthetic samples) - Rewrite README.md with stage structure, Colab setup instructions, per-stage test questions, and embedded SVG diagrams - Add AGENTS.md for AI-assisted learning discovery and teaching walkthrough guidance - Add assets/ folder with 4 SVG diagrams: partition-vs-table, rdd-vs-dataframe, csv-vs-parquet, groupby-vs-window

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Colab compatibility, NYC TLC dataset, new README and AGENTS.md#1

fix: Colab compatibility, NYC TLC dataset, new README and AGENTS.md#1
CLAV88 wants to merge 1 commit intocoder2j:mainfrom
CLAV88:main

CLAV88 commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

CLAV88 commented May 2, 2026

What this PR fixes

Problems fixed

Dataset replacement

New files

Tested on

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant